Team ID: Team 6

NAME: Connor Rosenberg

NAME: Rongkui Han

NAME: Yuqing Yang

NAME: Nassim Ali-Chaouche


Introduction

Background

The Student/Teacher Achievement Ratio (STAR) was a four-year longitudinal class-size study funded by the Tennessee General Assembly and conducted by the State Department of Education. Over 7,000 students from kindergarten to 3rd grade in 79 schools were randomly assigned into one of three interventions: small class (13 to 17 students per teacher), regular class (22 to 25 students per teacher), and regular-with-aide class (22 to 25 students with a full-time teacher’s aide). The interventions were initiated as the students entered school in kindergarten and continued through third grade.

Besides following standard procedures that ensured confidentiality and ethics in human subjects’ research, Project STAR also highlights the important features of:

  1. Each school included in the study had to have a large enough student body to form as least one of each of the three class types.

  2. Students and teachers were randomly assigned to their class type.

Project STAR is an example of stratified randomized design, where experimental units are grouped together according to certain pre-treatment characteristics into strata. Within each stratum, a completely randomized experiment is conducted. In the case that there exist population structures that associate or covary with the experimental outcome, stratified randomized experiments generally are more informative than completely randomized experiments (Suresh K., 2011). The goal of a stratified study is usually not to identify treatment effects within a single stratum, but rather treatment effects across all strata.

Because it is reasonable to expect systemic differences in educational outcome across schools, due to reasons such as demographics, school could be a confounding variable and influence the ooutcome of the Project STAR research. Therefore, each school should be viewed as a stratum in analyzing Project STAR data.

To expand on previous findings, teachers, rather than individual student, will be used as experimental units. The adjustment of experimental unit will facilitate the making of a causal statement as to the effect of class size on educational outcome, because the teacher experimental units more convincingly satisfy the stable unit treatment value assumption (SUTVA) and independence assumptions necessary for causal inferences. The detail of these assumtions will be dissected in the discussion section.

Questions of Interest

  1. Is there a significant difference in a first-grade teacher’s teaching performance in math across the three different class sizes?
  2. Are teacher’s performances relatively stable between different schools? That is, does the school itself affect class average math scores?
  3. Does our ANOVA model fit well with the data? In other words, are the analysis of variance assumptions satisfied?
  4. Can we draw causal conclusion that class sizes affect the class average math scores of first-grade teachers?

Analysis Plan

Population and study design

A two-way ANOVA test is fitting for answering our questions of interest under the stratified randomized design. One factor in the ANOVA model will be class size, whose main effect is of primary interest in this study. The other factor will be school ID, in order to control for and observe the stratum effect. Our model is appropriate to answer the questions of interest because it captures the effect of each treatment on the class’ median math performance while controlling for other external factors by blocking by school id.

In oreder to treat teachers as the experimental units, we will use the median scaled 1st grade math score of all students under each teacher for our analysis. The median score of a class truthfully reflects the class’ performance. In additon, the median is usually a more robust summary statistic than the mean, because it is less affected by outliers.

Statistical Analysis

Descriptive Analysis

We will examine the following aspects to characterize the independent and response variables:

  • Completedness and balancedness of the strata;
  • Distribution of response variable, median scaled 1st grade math score, across schools (strata);
  • Distribution of response variable across class sizes;

Main Analysis

For our analysis, we will construct the following factor effects model for the classroom median math score.

\[y_{ijk} = μ_{..} + τ_i + β_j + (τβ)_{ij} + ε_{ ijk}\] for: \[i \in [1,2,3,4]\] \[j \in [1,2,...,76]\] Where:

\[\sum_{i=1}^{4} τ_i = 0\] \[\sum_{i=1}^{76} β_j = 0\] - \(μ_{..}\) represents the overall classroom median score across all treatment levels.
- \(τ_i\) represents the effect of each class size on the overall median math score.
- \(β_j\) represents the effect of each school on the overall median math score.
- \((τβ)_{ij}\) represents the interaction effect, if any, of each school & class size combination on the overall median math score.

I really think we should talk about the alternative, no-interaction model here.

Model Diagnostics

We will use Q-Q plot, histogram and the Shapiro-Wilk test inspect the normality of residuals. A scatter plot and a fitted-value-versus-residual scatter plot and the Levene test will be used to examine equality of residual variance. Independence of residual and outlying data points will also be discussed.

Results

Descriptive Analysis

After filtering out entries with missing datapoints in variables relevant to 1st grade performance, our dataset included median scaled 1st grade math scores from 338 teachers from 76 schools. Seventy-two out of 76 schools had least one complete set of the three different class sizes (Figure 1a). The four schools with incomplete treatment sets all had both regular classes and small classes. Due to the small number of incomplete strata and the presence of more than one treatments within them, we decided to retain these schools in the analysis (Kutner et al., 2005, p.966). Figure 1b, the boxplots of the classroom median scaled math scores for each school, highlights the high variability in teacher performance across distinct schools. The distribution of classroom median scaled math scores differs across class sizes (Figure 1c), with large within-group variance that causes the distributions to overlap with one another.

Figure 1. (a). Results of proposed descriptive analysis show completedness of strata. (b, c) Differences in median scaled 1st grade math score across schools and class types.

Main Analysis

ANOVA Table with Interaction
Df Sum Sq Mean Sq F value Pr(>F)
g1_classtype 2 10796.02 5398.01 17.78 <.001
g1_schoolID 75 149258.51 1990.11 6.55 <.001
g1_classtype:g1_schoolID 146 46733.74 320.09 1.05 0.39
Residuals 114 34611.56 303.61 NA NA

Model Diagnostics

Figure 2: Visual diagnostics of ANOVA model assumptions. (a). Normal Q-Q plot of residuals. (b) Histogram of model residuels. (c) Residual-versus-fitted value scatter plot.

Normality:

From the Q-Q plot of the residuals (Figure 2a) , we can observe that most of that data points lie on a straight line, which is close to what we expect to see from a normal distribution. There are some points at the right tail which have a higher probability mass than expected; however, these points are only a few compared to the total amount of data. Thus, the normality assumption is largely satisfied from the Q-Q plot. A histogram is used to visualize the distribution of the residuals (Figure 2b). From the histogram we can observe that the distribution of the residuals looks symmetric about the mean and bell-shaped, similar to what is seen from a normal distribution.

To further test for normality of the errors, a Shapiro-Wilk test will be used on the distribution of the residuals. A Shapiro-Wilk test is used to test whether a distribution of data follows a normal distribution.

The null and alternative hypotheses of the Shapiro-Wilk test are:
\(H_0\): The residuals are normally distributed.
\(H_1\): The residuals are not normally distributed.

Shapiro-Wilk Normality Test
W Statistic P-Value
0.99 0.13

The p-value of 0.13 is greater than the significance level of 0.05, and thus we fail to reject the null hypothesis. Thus, there is no evidence that the distribution of the residuals does not follow a normal distribution.

Equal Variances:

The residuals are evenly spread out along the y-axis in a residual-versus-fitted value scatter plot (Figure 2c). Visually, the equal variance assumption is satisfied.

A Levene’s test is used to further test the equal variance assumption for both independent variables.

The null and alternative hypotheses of the Levene’s test are:
\(H_0\): The residual variances are equal across groups.
\(H_1\): Not all residual variances are equal across groups.

Levene Test for Class Type Variable
df F Value Pr(>F)
2, 335 0.34 0.71
Levene Test for School ID Variable
df F Value Pr(>F)
75, 262 0.76 0.92

In the Levene’s test for both independent variables, both p-values are greater than the significance level of 0.05. Thus for each variable, there is no evidence that residual variances are not equal across treatment groups. Thus, the equal variance assumption is satisfied.

Independence:

This experiment is randomized in two ways and represents the best controlled environment to achieve independence of the results. First of all, teachers are randomly assigned to each class type: small, regular, and regular with aide. Second, every student is randomly assigned to each teacher. There is a possibility of the independence assumption not holding completely from the time of randomization to the time that the test scores were recorded. For instance, teachers may share materials, or parents of the same class may together decide to seek out tutoring for their children. However, given the randomized design of the experiment and the relatively large sample size, we may assume that the results are essentially independent.

Outliers:

An outlier will be defined if it has a studentized residual value that is greater than 3 in absolute value.

undefined variables here

There is only one outlier in the data, which is observation number 252. Thus, outliers do not pose a problem in our analysis.

Tests

Task 6: Test whether there is a difference in math scaled score in 1st grade across teachers in different class types. Justify your choice of test.

Discussion

In this report, we presented our usage of 2-way ANOVA to analyze the effect of class size on first-grade teachers’ teaching performance in math in a stratified randomized experiment, using each school as a stratum.

Stratified Randomized Design

Exploratory analysis highlights the variability in teacher performance across distinct schools. This variability is likely due to the similar demographic features within schools, but various demographic features between them. For example, schools located in areas of high affluence may achieve better classroom performance since more students have access to academic support in addition to greater parent oversight. Similarly, schools who pull their students from less affluent areas may see worse classroom performance due to student food insecurity, lack of academic support, and other social deficiencies. Because of this high variability in median classroom performance between schools, blocking by school ID in our model helps to extract the precise effect of the class size treatment on the classroom median performance.

Exclusion of Interaction Terms

We explored the effect of including schoo-by-class size interactions in our model, and concluded that interactions between the two factors did not contribute significantly to the variance partitioning of the data. A model without interaction was used in all following analyses. Philosophically, excluding interaction terms from the model also fits the purpose of a stratified randomized experiment, because we are not primarily interested in the class size effect within individual schools, but rather its main effect across all schools. Eliminating interaction terms results in fewer parameters to estimate and hence higher power of the test.

Main Effect

Model diagnostics suggested that the dataset satisfied assumptions for ANOVA. Results derived from the fitting of the model suggested significant difference in a first-grade teacher’s median math scores across different class sizes. Pairwise comparisons suggested that small classes and regular classes with aides both outperformed regular classes without aides. The model also revealed significant performance differences across schools, with the largest pairwaise difference being XXX in class median 1st grade scale math score.

Causal Inference

This analysis enables us to make causal statements regarding the effect of class size on teacher’s performance in math education. This is made possible by using teachers as experimental units, thus satisfying the SUTVA and independence assumptions necessary for causal inferences:

SUTVA: Definition: The potential outcomes for any unit do not vary with the treatments assigned to other units, and, for each unit, there are no different forms or versions of each treatment level, which lead to different potential outcomes.
The experimental unit used in the analysis first satisfies the no-interference component of SUTVA – the assumption that the treatment applied to one unit does not affect the outcome for other units. On the basis of prior knowledge of school systems, it is realistic to assume that one teacher being assigned to a specific class size does not affect the teaching outcoome of another teacher. The second component of SUTVA requires that individuals receiving a specific treatment cannot receive different forms of that treatment. In our case, due to the strict randomization implemented in the experiment, the class taught by one teacher is by nature homogenous with a class taught by another.

Independence Assumption:
Definition: the assignment of treatment is independent of potential outcomes of experimental units.
This assumption is met in the experiment by using double randomization: One random assignment is that of teachers to classes. The second randomization is of students to classes/teachers. The design ensures that high/low performance teacher or students were not systematically enriched in any class-size treatments. In light of this, systematic effects can be interpreted as the effects of class size.

Therefore, our analysis concludes that smaller class size has a positive average causal effect on a teacher’s teaching outcome in math. This is different from the conclusion of Project I. SUTVA was not plausible when using individual students as experimental units. Interactions between students likely resulted in altered potential outcome of one student due to the treatment assigned to another, thus violating SUTVA. In that case, rejections of the null hypothesis would not necessarily be convincing evidence of effects of class size; it may simply indicate the presence of peer effects. In contrast, using teachers as experimental units does not rely on no-interference assumptions among students. This makes the results reported here credible evidence of causal class-size effects.

Reference

Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models. New York: McGrawHill Education.
Suresh K. (2011). An overview of randomization techniques: An unbiased assessment of outcome in clinical research. Journal of human reproductive sciences, 4(1), 8–11. doi:10.4103/0974-1208.82352